Communication Pattern Based Checkpointing Coordination for Fault-tolerant Distributed Computing Systems

نویسندگان

  • Taesoon Park
  • Heon Y. Yeom
چکیده

This paper presents a new checkpointing coordination scheme which utilizes the communication pattern of the cooperating processes. In the proposed scheme, the checkpointing is coordinated for the limited number of processes based on the information regarding the communication pattern of the target program. Unlike the previous solutions which do not utilize the communication pattern, it is possible to reduce the coordination eeort as well as the checkpointing frequency. Extensive simulation has been performed to evaluate the performance of the proposed scheme and we concluded that the proposed scheme signiicantly reduces the checkpointing overhead compared with the loose coordination schemes.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Application controlled checkpointing coordination for fault-tolerant distributed computing systems

In order to provide fault tolerance for distributed systems, the checkpointing technique has widely been used and many researches have been performed to reduce the overhead of check-pointing coordination. In this paper, we present a new checkpointing coordination scheme in which the application controls the coordination activity by utilizing the communication pattern of the application program....

متن کامل

An Enhanced MSS-based checkpointing Scheme for Mobile Computing Environment

Mobile computing systems are made up of different components among which Mobile Support Stations (MSSs) play a key role. This paper proposes an efficient MSS-based non-blocking coordinated checkpointing scheme for mobile computing environment. In the scheme suggested nearly all aspects of checkpointing and their related overheads are forwarded to the MSSs and as a result the workload of Mobile ...

متن کامل

Stability Assessment Metamorphic Approach (SAMA) for Effective Scheduling based on Fault Tolerance in Computational Grid

Grid Computing allows coordinated and controlled resource sharing and problem solving in multi-institutional, dynamic virtual organizations. Moreover, fault tolerance and task scheduling is an important issue for large scale computational grid because of its unreliable nature of grid resources. Commonly exploited techniques to realize fault tolerance is periodic Checkpointing that periodically ...

متن کامل

New Paradigms in Check-pointing Techniques in Distributed Mobile Systems

Distributed systems today are ubiquitous and enable many applications, including client–server systems, transaction processing, the World Wide Web, and scientific computing, among many others. Distributed systems are not fault-tolerant and the vast computing potential of these systems is often hampered by their susceptibility to failures. Many techniques have been developed to add reliability a...

متن کامل

Computing in the RAIN: a reliable array of independent nodes - Parallel and Distributed Systems, IEEE Transactions on

ÐThe RAIN project is a research collaboration between Caltech and NASA-JPL on distributed computing and data storage systems for future spaceborne missions. The goal of the project is to identify and develop key building blocks for reliable distributed systems built with inexpensive off-the-shelf components. The RAIN platform consists of a heterogeneous cluster of computing and/or storage nodes...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007